[Clang/TSAN bug, not wasm] Clang13 TSAN failure in wasm tiering
Categories
(Firefox Build System :: General, defect, P3)
Tracking
(Not tracked)
People
(Reporter: lth, Unassigned)
References
(Blocks 1 open bug)
Details
(Keywords: leave-open)
Attachments
(2 files, 1 obsolete file)
Clang13 TSAN reports a race condition in our tiering code, see https://firefox-ci-tc.services.mozilla.com/tasks/B4sU3VtNR2264ZjrYSqWTA/runs/0/logs/public/logs/live.log#L26976. The information so far:
The stack traces of the two accesses point to a read from Code::bestTier() on the main thread and a write from UniquePtr::reset() called from Code::setTier2() on a background thread.
The offending variable appears to be the unsynchronized UniquePtr tier2_ in the Code object: offset 16 bytes from the start of a Code object malloc'd by the main thread in ModuleGenerator::finishModule.
Access to tier2_ is controlled by the Atomic<bool> hasTier2_ variable. While hasTier2_ is false, tier2_ shall not be read; once hasTier2_ is set, tier2_ shall not be written (except for destruction). hasTier2_ is sequentially consistent and writing it publishes the value of tier2_ to other threads; all bits of tier2_ written by or safely seen by the thread that sets hasTier2_ become visible to all other threads that subsequently read hasTier2_.
The tier2_ variable is written by UniquePtr::reset, called from setTier2(). Before this, setTier2() checks that hasTier2_ is false.
The tier2_ variable is read by bestTier(), but only after checking that hasTier2_ is true. It is also read by some other functions in WasmCode.cpp and they all check that hasTier2_ is true before they access tier2_, using either 'if (hasTier2()) ...' or 'hasTier2() && ...'. [There is an exception, namely codeTier(t), but investigation shows that this is not an issue, and it can be strengthened to the point where it is no longer an exception.]
The hasTier2_ variable transitions from false to true only as a result of a call to commitTier2(). There is a single call to commitTier2() in the code, and it textually follows the call to setTier2(), and both calls are from within Module::finishTier2(). (FWIW, neither setTier2() or commitTier2() appear to have been inlined into finishTier2().)
The only way this race could happen is if hasTier2_ becomes set after it is checked in setTier2() so a concurrent thread passes the tier 2 test in bestTier() and reads tier2_ around the same time as setTier2() is writing to tier2_.
All the failures (there are 193 reported in this particular run) fail in the same way. And all fail with --test-wasm-await-tier2, which is sensible - most of the time, this is the only way we get to test tiering on CI.
Various hypotheses:
- This is a huge number of race conditions for a test run and makes me suspicious that the analysis is not actually correct.
- We could be starting multiple background compilations, possibly as a result of the recent rejiggering of the helper threads system, this would explain th race for sure.
- It could also be a bug not in tiering but in our testing of tiering.
(There is a second bug here. In Code::commitTier2 there is an ordering bug: a MOZ_RELEASE_ASSERT reads tier2 data to check that it is indeed in place before setting the hasTier2_ variable. This is wrong because it is the writing of hasTier2_ that signals that it is ok to read tier2 data. The race is benign in the sense that it is the present thread that has created the tier2 data, and thus inconsistent data will never be read here. And it does not seem like this is the bug causing our current problems.)
Reporter | ||
Updated•4 years ago
|
Reporter | ||
Comment 1•4 years ago
|
||
I think the second bug above, starting multiple background compilations for the same module, is very unlikely. commitTier2 checks that tier2 has not already been committed with a release assert, and if we had multiple tier-up compilations going we would have crashed here a long time ago.
Reporter | ||
Comment 2•4 years ago
|
||
It is an invariant of Code::tier2_ that it is never read without the
atomic Code::hasTier2_ being true. This invariant was violated in a
couple of harmless cases, where code on the thread that had just
written tier2_ (the committing background thread) would read tier2_
again before the tier-2 code was committed by hasTier2_ being written.
TODO: I added a protocol here for setTier2 to return a borrowed
pointer to the tier-2 code. But it's possible that the caller of
setTier2 should just extract that pointer from values it already has
in hand. Discuss.
Reporter | ||
Comment 3•4 years ago
|
||
I think we need to extract the machine code for Code::bestTier from the executable. We need to verify that there is no illegal reordering going on. It would be possible for the compiler to hoist the read of the tier2_ field above the guard without risking a crash, but only if it assumes that hasTier2() does not create an ordering relationship in the accesses. (That would be a compiler bug and thus is unlikely, but it would be good to rule it out.)
Reporter | ||
Comment 4•4 years ago
|
||
Disassembly below from the build artifact. Argument registers, in order: RDI, RSI, RDX, RCX, R8, R9. Nonvolatile registers: RBP, RBX, R12, R13, R14, R15. Code::bestTier() takes only a single argument, namely 'this', which is passed in RDI. It selects a CodeTier t and then calls t->tier(), which first reads t->segment_ and then does stuff with that.
The disassembly shows that while the actual reading of tier2_ is guarded by a branch on the atomic value, the TSAN flagging of the read of tier2_ has been hoisted to before the conditional branch and is therefore incorrect, ie this is a TSAN/Clang bug. Andi, can you please move this bug to the appropriate Bugzilla component? I'll look into cleaning up the attached patch and landing it, but likely under a different bug.
(gdb) disassemble Code::bestTier
push %rbp
mov %rsp,%rbp
push %r15 ;; save
push %r14 ;; callee-
push %r12 ;; saves
push %rbx ;; registers
mov %rdi,%rbx ;; RBX = this
mov 0x8(%rbp),%rdi ;; RDI = <return address>
call 0x21a640 <__tsan_func_entry()> ;; flag fn entry
lea 0x18(%rbx),%rdi ;; RDI = &this->hasTier2_
mov $0x5,%esi ;; RSI = magic argument
call 0x1fdb40 <__tsan_atomic32_load()> ;; signal the atomic load
mov %eax,%r14d ;; R14 = <result of atomic load>
mov 0x18(%rbx),%eax ;; (void)atomic_load(this->hasTier2)
lea 0x10(%rbx),%r15 ;; R15 = &this->tier2_
lea 0x8(%rbx),%rdi ;; RDI = &this->tier1_
call 0x2141c0 <__tsan_read8()> ;; flag the read of tier1_
mov 0x8(%rbx),%r12 ;; R12 = this->tier1_
mov %r15,%rdi ;; RDI = &this->tier2_
call 0x2141c0 <__tsan_read8()> ;; flag the read
test %r14d,%r14d ;; test hasTier2_
je 0x12389ed L1 ;; and jump if false
mov 0x10(%rbx),%r12 ;; R12 = this->tier2_
L1:
lea 0x10(%r12),%rdi ;; RDI = &theTier->segment_
call 0x2141c0 <__tsan_read8()> ;; flag the read
mov 0x10(%r12),%rbx ;; RBX = theTier->segment_
lea 0x24(%rbx),%rdi ;; ... and so on
call 0x213a00 <__tsan_read4()>
mov 0x24(%rbx),%ebx
call 0x21a6d0 <__tsan_func_exit()>
mov %ebx,%eax
pop %rbx
pop %r12
pop %r14
pop %r15
pop %rbp
ret
Reporter | ||
Comment 5•4 years ago
|
||
(In fairness, it doesn't have to be a compiler bug, it could be a bug in how the atomicity of the operations in mfbt/Atomics.h is made visible to the compiler. But given that the TSAN call is hoisted above the condition, it looks more like a compiler bug than like anything else.)
Comment 6•4 years ago
|
||
Comment on attachment 9244170 [details]
WIP: Bug 1733908 - Strengthen invariant on Code::tier2_
Revision D127421 was moved to bug 1733970. Setting attachment 9244170 [details] to obsolete.
Reporter | ||
Comment 7•4 years ago
|
||
Not a wasm bug.
Updated•4 years ago
|
Comment 8•4 years ago
|
||
The product::component has been changed since the backlog priority was decided, so we're resetting it.
For more information, please visit auto_nag documentation.
Updated•4 years ago
|
Comment 9•4 years ago
|
||
Looks like there are a couple of impacted sites - these temporary suppressions should clear up the tests until they can be addressed.
Updated•4 years ago
|
Comment 10•4 years ago
|
||
Updated•4 years ago
|
Comment 11•4 years ago
|
||
Try pushes are showing one more data race that I didn't catch.
Comment 12•4 years ago
|
||
bugherder |
Comment 13•4 years ago
|
||
Comment 14•4 years ago
|
||
bugherder |
Updated•4 years ago
|
Comment 15•4 years ago
|
||
Anything that the upcoming Clang 13.0.1 and WASI-SDK 14 can remedy here?
Comment 16•3 years ago
|
||
The leave-open keyword is there and there is no activity for 6 months.
:ahochheiden, maybe it's time to close this bug?
For more information, please visit auto_nag documentation.
Comment 17•3 years ago
|
||
I'm assuming this still needs to stay open, but I'm not sure.
:Decoder, you own the parent tsan
bug, what do you think?
Comment 18•2 years ago
|
||
The leave-open keyword is there and there is no activity for 6 months.
:ahochheiden, maybe it's time to close this bug?
For more information, please visit auto_nag documentation.
Comment 20•2 years ago
|
||
As long as we have a suppression for this in mozglue/build/TsanOptions.cpp
, this bug needs to stay open (we could check if we still need it with Clang 15).
Description
•